{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Intro to Pandas — Class Notes\n", "\n", "\n", "## Try me\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ffraile/computer_science_tutorials/blob/main/source/Data%20Manipulation/class%20notes/pandas_class_notes_notebook.ipynb)[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/ffraile/computer_science_tutorials/main?labpath=source%2FData%20Manipulation%2Fclass%20notes%2Fpandas_class_notes_notebook.ipynb)" ], "id": "61036bb3282edcdf" }, { "metadata": {}, "cell_type": "markdown", "source": [ "## Introduction\n", "### Motivation\n", "- Pandas is the absolute foundation for data science in Python.\n", "- Essential tool if you are dealing with any kind of data." ], "id": "ef394c2ab06aff34" }, { "metadata": {}, "cell_type": "markdown", "source": [ "## Objectives\n", "- Cut through all the complexity and understand the core concepts.\n", "- High impact knowledge of how Pandas actually works.\n", "- Get you started with real-world data manipulation tasks." ], "id": "431e6b953d48eddf" }, { "metadata": {}, "cell_type": "markdown", "source": [ "### Intro for non-programmers (I)\n", "- \"Spreadsheets on steroids\": powerful data manipulation tool (like Excel, but way more powerful).\n", "- Overwhelmed by endless lists or arrays? Pandas provides:\n", " - Convenient and robust Data structures.\n", " - Functions to filter, aggregate, and transform data easily.\n", "- \"Nervous system\" for data analysis in Python." ], "id": "57101b7a812ee514" }, { "metadata": {}, "cell_type": "markdown", "source": [ "### Intro for non-programmers (II)\n", "- Use case examples:\n", " - Complex financial data analysis.\n", " - Cleaning messy datasets like survey results.\n", " - Combining multiple data sources." ], "id": "2aaf6511e85431eb" }, { "metadata": {}, "cell_type": "markdown", "source": [ "### Intro for non-programmers (III)\n", "- Pandas revolves around two core data structures:\n", " - **Series**: 1D labeled array (like a column in a spreadsheet).\n", " - **DataFrame**: 2D labeled table (like an entire spreadsheet with rows and columns).\n", "\n", "- Any single column in a DataFrame is under the hood a Series.\n", "- All Series in a dataframe share the same **index** (row labels).\n", "- Columns in a DataFrame can have **different data types** (ints, floats, strings, etc).\n", "- This flexibility makes Pandas ideal for real-world datasets: Different data types in a single container.\n" ], "id": "6be0d0b0b871500b" }, { "metadata": {}, "cell_type": "markdown", "source": [ "#### Intro for non-programmers (IV)\n", "- Pandas facilitates tne entire **data lifecycle**:\n", " - Loading data from various formats (CSV, JSON, Excel, SQL databases, etc).\n", " - Cleaning and transforming data (handling missing values, filtering rows, aggregating data, etc).\n", " - Analyzing and visualizing data (summary statistics, plots, etc).\n", "- Pandas is optimized for performance with large datasets (millions of rows).\n", "- Integrates seamlessly with other Python libraries (NumPy, Matplotlib, Scikit-learn) stay tuned for more!" ], "id": "e1ae0718690031cd" }, { "metadata": {}, "cell_type": "markdown", "source": [ "### Agenda\n", "- Intro and agenda (20 min)\n", "- Loading data (15 min)\n", "- Querying (15 min)\n" ], "id": "7af3376ecbe0362" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup pandas\n", "Pandas is available off-the-shelf in Colab . If you are running locally, you may need to install it first:\n", "```bash\n", "pip install pandas\n", "```\n", "If not, just import:" ], "id": "869d80c840af437" }, { "cell_type": "code", "metadata": {}, "source": "import pandas as pd", "id": "4dd7e1c1d7769238", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": "This ```pd```alias (de-facto standard) will become your best friend when working with data!", "id": "e51bdf77632782a3" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create DataFrames and load data\n", "\n", "We are going to present two different ways to create DataFrames that cover most common use cases:\n", "1) From scratch (e.g., for small datasets or testing): Creating datasets from Python dictionaries/lists.\n", "2) Loading from JSON/CSV files (most common in real-world scenarios). Other methods are supported (e.g., Excel, SQL databases, Parquet, etc) but out of scope for this intro.\n", "\n", "#### Creating a dataframe from a dictionary\n", "Pandas can create DataFrames directly from Python dictionaries/lists.\n", "- Each key in the dictionary becomes a column\n", "- The values become the rows." ], "id": "ad19bb3ce55d9ba7" }, { "cell_type": "code", "metadata": {}, "source": [ "data = {\n", " \"id\": [1,2,3,4,5],\n", " \"name\": [\"Ada\",\"Alan\",\"Grace\",\"Linus\",\"Edsger\"],\n", " \"score\":[9.7, 8.2, 9.6, 7.5, 8.9],\n", " \"program\": [\"CS\",\"CS\",\"DS\",\"SE\",\"CS\"]\n", "}\n", "df1 = pd.DataFrame(data)\n", "df1" ], "id": "ad372ed5711ef4bf", "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating a dataframe from records\n", "You can also create DataFrames from a list of records (dictionaries), where each dictionary represents a row." ], "id": "e4eace260f6d4b71" }, { "cell_type": "code", "metadata": {}, "source": [ "students = [\n", " {\"id\": 101, \"name\": \"Ada Lovelace\", \"program\":\"CS\", \"score\": 9.5},\n", " {\"id\": 102, \"name\": \"Alan Turing\", \"program\":\"CS\", \"score\": 8.8},\n", " {\"id\": 103, \"name\": \"Grace Hopper\", \"program\":\"DS\", \"score\": 9.6},\n", "]\n", "df2 = pd.DataFrame.from_records(students)\n", "df2" ], "id": "870c415d4c9b0f2b", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": [ "### Importing large datasets from JSON/CSV\n", "In real-world scenarios, datasets are often stored in files (locally or remotely). Pandas provides convenient functions to load data from common formats like JSON and CSV.\n", "We are going to load data from a CSV with information about diabetes patients.\n", "\n", "\n", "The dataset contains the following columns:\n", "\n", "* Pregnancies: Number of times pregnant\n", "* Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test\n", "* BloodPressure: Diastolic blood pressure (mm Hg)\n", "* SkinThickness: Triceps skin fold thickness (mm)\n", "* Insulin: 2-Hour serum insulin (mu U/ml)\n", "* BMI: Body mass index (weight in kg/(height in m)^2)\n", "* DiabetesPedigreeFunction: Diabetes pedigree function\n", "* Age: Age (years)\n", "* Outcome: Class variable (0 or 1) determining if the patient has diabetes\n", "* 268 of 768 are 1, the others are 0\n", "* Class Distribution: (class value 1 is interpreted as \"tested positive for diabetes\")\n", "\n", "The following code loads the dataset into a Pandas dataframe:" ], "id": "985cdd26e8ac3219" }, { "metadata": {}, "cell_type": "code", "source": [ "df = pd.read_csv('https://raw.githubusercontent.com/ffraile/computer_science_tutorials/main/source/Data%20Manipulation/exercises/datasets/diabetes.csv')\n", "df" ], "id": "e82302d1875caa13", "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inspecting and accessing data\n", "### Inspecting data\n", "- `df.head(n)`: first n rows (default 5)\n", "- `df.tail(n)`: last n rows (default 5)\n", "- `df.dtypes`: data types of each column\n", "- `df.info()`: summary of dataframe (rows, columns, dtypes, memory usage)\n", "- `df.describe()`: statistical summary of numeric columns" ], "id": "f563903f031b22e9" }, { "cell_type": "code", "metadata": {}, "source": "df.info()", "id": "e41398ab25d9c6d0", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": "df.describe()", "id": "26152a94e369b66d", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": "df.head()", "id": "847c8af6fc7f655f", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": "See the numbers to the left of each row? That's the **index** (row labels). By default, it's just sequential numbers 0,1,2,... telling you the order of the row, but you can customize it, setting one of the columns as index or providing your own labels.", "id": "ea9bf4ded7fc121c" }, { "metadata": {}, "cell_type": "code", "source": [ "df1 = df1.set_index(\"id\")\n", "df1" ], "id": "e02bd0929fb17661", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": [ "### Selecting rows/columns (very brief)\n", "- Column(s): Select 1 colum ```df['name']```, `df[['name','score']]`\n", "- Row slice: `df[1:4]` (by position)\n", "- loc/iloc: `df.loc[row_labels, col_labels]`, `df.iloc[row_idx, col_idx]`" ], "id": "e97a32f7e7e9972f" }, { "metadata": {}, "cell_type": "code", "source": [ "print(df1['name']) # one column\n", "print(df1[['name','score']]) # double brackets for multiple columns\n", "print(df1[1:4]) # row slice by position\n", "print(df1.iloc[:3, :2]) # first 3 rows, first 2 columns by position" ], "id": "a3de4197481866ce", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": [ "You can also access individual cells:\n", "- `df.at[row_label, col_label]`: access by labels\n", "\n", "However, this is not very common in practice: You leverage the power of Pandas by working with entire columns/rows at once." ], "id": "848eb19159a730b6" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Querying: masks, .query, string ops, isin, between\n", "Pandas provides multiple ways to filter/query data based on conditions. Here are some common methods:\n", "- Boolean masks: create a boolean Series and use it to filter rows\n", "- `.query()`: use a query string to filter rows\n", "- String operations: use `.str` accessor for string columns\n", "- `isin()`: filter rows where column values are in a list\n", "- `between()`: filter rows where column values are within a range" ], "id": "198c1f86b01bf3a4" }, { "cell_type": "code", "metadata": {}, "source": [ "high = df1[df1[\"score\"] >= 9] # Boolean mask: Filter rows with score >= 9\n", "high" ], "id": "58e346a0b4e385b4", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": [ "diabetes = df[df[\"Outcome\"] == 1] # Boolean mask: patients with diabetes\n", "diabetes" ], "id": "c529055b6ca5bf42", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": [ "cs = df1.query(\"program == 'CS' and score >= 8.5\") # Querying combining different mask\n", "cs" ], "id": "4063a19430275d35", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": [ "names_with_a = df1[df1[\"name\"].str.contains(\"a\", case=False)]\n", "names_with_a" ], "id": "5570c9380bb60d2d", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": [ "subset = df1[df1[\"program\"].isin([\"CS\",\"DS\"]) & df1[\"score\"].between(8.5, 9.8)]\n", "subset" ], "id": "757eafc007a3b4ca", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": [ "## Playground! Try your own queries here\n", "# e.g., filter students with score < 8 or blood pressure > 80 of patients with diabetes" ], "id": "b951fdc77d1d8e4e", "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": "### B-PI Cards\n- **P1**: Predict rows for `df.query(\"program == 'CS' and score >= 9\")`.\n- **P2**: Minimal fix: `df[df['program' == 'CS']]`.\n- **P3**: Predict: `df[df['name'].str.startswith(('A','G'))]`.", "id": "c9a9bc5529a0b41a" }, { "cell_type": "markdown", "metadata": {}, "source": "## Mini-challenges\n1) **CSV parsing:** modify `read_csv` to read a `;`-separated version of `csv_text` and parse `id` as integer and missing scores as `NaN`.\n2) **JSON lines:** build a JSON-lines string from `students` and load it with `pd.read_json(..., lines=True)`.\n3) **Top-K filter:** keep only the top-3 `score` students in `df` via thresholding (no `nlargest` shortcut).", "id": "633f1916841b9f7d" }, { "cell_type": "code", "metadata": {}, "source": "# Hints (not executed)\n# 1) pd.read_csv(StringIO(csv_text_semicolon), sep=';', dtype={'id':'Int64'}, na_values=['', 'NA'])\n# 2) js_lines = '\\n'.join(json.dumps(r, ensure_ascii=False) for r in students); pd.read_json(StringIO(js_lines), lines=True)\n# 3) thr = df['score'].nlargest(3).min(); top3 = df[df['score'] >= thr]", "id": "2b57aca64257603c", "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": "# Takeaways\n- **JSON**: great for nested data; ensure valid JSON (no comments/trailing commas); control with `indent`, `ensure_ascii`.\n- **CSV**: plain tabular text; be explicit with delimiter/quoting; watch commas in text; use `newline=''` on write.\n- **pandas**: read JSON/CSV, inspect dtypes, select, and **query** with masks or `.query()`; string helpers like `.str.contains` are key.\n", "id": "e4711d29fd920e58" } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.x" } }, "nbformat": 4, "nbformat_minor": 5 }